The files in this archive are the raw data, processed data, R script, and related materials used in the 2013 English Wikipedia plagiarism research project: https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Education_Program/Research/Plagiarism

These are the checked cohort datasets corresponding to the csv files:

taskus a.csv = S new
taskus b.csv = 2006
taskus c.csv = 2009
taskus d.csv = 2012
taskus e.csv = S 2013
taskus f.csv = match
taskus g.csv = S expand
taskus h.csv = active
taskus i.csv = admins

The corresponding raw data received from TaskUs is contained in the .ods spreadsheet. To begin from that data, save each sheet as a .csv named 'taskus a.csv' and so on.

badlinks.txt is a list of websites that mirror Wikipedia articles that appear commonly in the raw data, formatted as regular expressions, which is used in the R script to remove hits to those sites as false positives.

taskus cleanup 4.R is the script used to process the data and graph it (as of 26 September 2013). Earlier versions of the script are included as well, but they don't necessarily work.

-Sage Ross
2013-09-26
